(navigation image)
Home American Libraries | Canadian Libraries | Universal Library | Community Texts | Project Gutenberg | Children's Library | Biodiversity Heritage Library | Additional Collections
Search: Advanced Search
Anonymous User (login or join us)
Upload
See other formats

Full text of "mit :: lcs :: tr :: MIT-LCS-TR-535"

m 



LABORATORY FOR 
COMPUTER SCIENCE 




MASSACHUSETTS 
INSTITUTE OF 
TECHNOLOGY 



MIT/LCS/TR-535 



ON THE SAMPLE COMPLEXITY 
OF PAC-LEARNING USING 
RANDOM AND CHOSEN 
EXAMPLES 



Bronwyn Bonnie Eisenberg 



March 1992 



545 TECHNOLOGY SQUARE, CAMBRIDGE, MASSACHUSETTS 02139 



On the Sample Complexity of Pac-learning using 
Random and Chosen Examples 

by 
Bronwyn Bonnie Eisenberg 

B.A., English 

Princeton University 

(1981) 

Submitted to the 

Department of Electrical Engineering and Computer Science 

in partial fulfillment of the requirements for the degree of 

Master of Science in Electrical Engineering and Computer Science 

at the 

Massachusetts Institute of Technology 
May 1991 

© Massachusetts Institute of Technology 1991 
Signature of Author 



Department of Electrical Engineering and Computer Science 

March 1, 1992 



Certified by 



Ronald L. Rivest 
Professor of Computer Science and Engineering 

Thesis Supervisor 



Accepted by 



Arthur C. Smith 
Chairman, Department Committee on Graduate Students 



Abstract 

Two protocols used for learning under the pac-learning model introduced by Valiant 
are learning from random examples and learning from membership queries. Membership 
queries have also been used to efficiently and exactly learn a concept class C that is too 
difficult to pac-learn using random examples. We ask whether using membership queries — 
in conjunction with or instead of random examples — can serve a new purpose: helping 
to reduce the total number of examples needed to pac-learn a concept class C already 
known to be pac-learnable using just random examples. We focus on concept classes that 
are dense in themselves, such as half-spaces of R B , rectangles in the plane, and the class 
I = {[0, a] : < a < 1} of initial segments of [0, 1]. 

The main results of this thesis are: 

1. Adding the option of using membership queries cannot significantly reduce the total 
number of examples needed to pac-learn a dense-in-itself concept class C; 
ft((l/e)ln(l/£)) random examples are still required to pac-learn C, where c and 6 are 
the usual pac-learning parameters. Interestingly, this bound holds for any unknown 
probability distribution, unlike the "standard" proof (due to Ehrenfeucht, Haussler, 
Kearns, and Valiant [12]), which holds only for a particular distribution constructed 
by an adversary. 

2. Even if the unknown probability distribution is known to be "smooth", at least 
(l/4e)ln(l/£) examples are required to pac-learn C from random examples (only). 

3. For the special case of learning half-spaces of an n-dimensional unit simplex, if the 
probability distribution is smooth and examples can be chosen as well as drawn ran- 
domly, then the number of examples required to pac-learn decreases significantly, to 
n 2 (lg(s/e) + 4), and the computational complexity is 0(n 7 s/e). 

Portions of this thesis are joint work with Ron Rivest. 
Thesis Supervisor: Ronald L. Rivest 

Title: Professor of Electrical Engineering and Computer Science 
Keywords: Pac-learning, Queries, Drawing examples, Choosing examples 



Contents 



1 Introduction 4 

1.1 Overview 4 

1.2 Standard Machine Learning Definitions 5 

1.3 Pac-learning Definition 8 

1.3.1 Pac-learning error rate 8 

1.3.2 Pac-learning complexity 8 

1.3.3 Protocols for obtaining examples 10 

1.4 New Definitions 10 

1.4.1 Dense-in-itself concept classes 10 

1.4.2 Smooth probability distributions 13 

1.5 Pac-learning Background 13 

1.6 The Results of this Thesis 15 

2 A Lower Bound When Both Drawing and Choosing Examples 18 

2.1 A Lower Bound 18 

2.2 Comparison of the Lower Bound to Previous Results 21 

3 Smoothness Guarantees 23 

3.1 A Lower Bound When Drawing Examples Only 23 

3.2 Upper and Lower Bounds When Both Drawing and Choosing Examples . . 25 

3.2.1 Using the binary-search approach in one dimension 25 

3.2.2 Generalizing the binary-search approach to n dimensions 27 



if? 



•» 



Wfi 



4 
i 



U 



Chapter 1 

Introduction 



1.1 Overview 

In 1984, Valiant published A Theory of the Learnable [18]. His paper introduced a model 
of learning that has come to be known as probably approximately correct learning, or pac- 
learning for short. As proposed by Valiant, a good learning algorithm should, after viewing a 
certain number of labeled examples, output a hypothesis that with high probability classifies 
almost all unseen examples correctly. Valiant's model has been widely embraced, and many 
interesting classes have been shown to be efficiently learnable under the pac-model. (The 
notion of efficiency is addressed in more detail later). 

In spite of the fact that the pac-model leads to efficient learning algorithms, however, 
the most traditional implementation of the model imposes restrictions on the way learning 
can be done, and these restrictions potentially drive up the number of labeled examples, 
in other words the amount of information, the learning algorithm needs to obtain in order 
to compute a hypothesis. This thesis asks the following question: can altering several 
of the pac-learning restrictions reduce the number of examples the learning algorithm, or 
equivalently the learner, needs to obtain in order to compute a hypothesis? 

In the standard pac-model, the learner has no choice over what examples are obtained. 
Additionally, the learner knows nothing about the probability distribution under which the 
learning is learning: in particular, any probability distribution is possible. In Chapter 2 we 
consider what additional efficiency can be gained in Valiant's model if the learner has some 
control over what examples it receives. We prove a negative result: in spite of having some 
choice over what examples are obtained, the number of examples the learner must obtain 
in order to compute a hypothesis is within a constant factor of the number of examples the 
learner must obtain in the standard pac-model. 

In Chapter 3, we ask whether there is an increase in efficiency if the learner knows ahead 
of time that certain probability distributions (for instance, extremely skewed distributions) 
are disallowed. Here we show a positive result: if the learner knows that certain probability 



distributions are not possible, then giving the learner some control over what examples it 
may obtain can lead to a substantial decrease in the number of examples needed in order 
to compute a hypothesis. 

In Chapter 4, we state conclusions and open problems. 

We continue this chapter by presenting some standard machine learning definitions, an 
introduction to pac-learning, and some new definitions. In light of our definitions and the 
introduction to pac-learning, we then discuss the learning models and main results of this 
thesis. 

1.2 Standard Machine Learning Definitions 

We consider the problem of learning an unknown concept, called the target concept, from 
examples. The examples are drawn from a space X , called the sample space. For instance, 
we may have X = {0, 1}" or X = R n . In this thesis we focus on continuous sample spaces, 
such as R n . We use the terms learner and learning algorithm interchangeably. 

A concept c is a subset of the sample space X. For x € X, we say that x G c is a 
positive instance (or positive example) of a concept c, and that x £ c is a negative instance 
(or negative example) of a concept c. We also use functional notation, writing c(x) = I 
for positive examples of a concept c and c(x) = for negative examples; we call c(x) the 
label or classification of x (by c). The unknown target concept is denoted c*. A hypothesis 
concept is denoted c. 

For instance, if X = R™, an example of a concept is a half-space defined by an n - 1- 
dimensional hyperplane. Such a hyperplane splits R n into two half-spaces. The subset of 
R n consisting of the points on the hyperplane and all points on one side of the hyperplane 
(that is, one of the half-spaces) is considered to be the concept. Each of these points is a 
positive example of the concept, and each point on the other side of the hyperplane is a 
negative example of the concept. 

A concept thus is a set of points. Observe that the set, which is potentially infinite, 
can, for concept classes of interest here, be represented by a finite set of parameters. For 
instance, a half-space is a set with an infinite number of points; it can be represented, 
however, by a hyperplane using a finite number of real- valued parameters. 

We assume that there is a probability distribution D defined on the sample space X. 
There are then two ways to obtain examples. First, the learner can draw examples from 
X according to probability distribution D. Here we say that the learner is learning from 



random examples. In this case, the learner has no choice over what examples it obtains. 
Second, the learner may also, in some cases, make a membership query arbitrarily in X, 
in which case the learner is also learning from membership queries. Making a membership 
query is also called choosing an example. With a membership query, the learner can ask 
about any point x, even if x is in a set that D assigns zero probability. 

We assume that the learner is told c t (x) for any example x obtained, so that the learner 
knows whether a; is a positive example or a negative example of the target concept. We say 
that the learner sees examples in a set S if it either draws a random example from S or 
makes a membership query in S. 

A concept class C is the set of all concepts of a particular type. For instance, the set of all 
half-spaces in R" (each half-space defined by an n - 1-dimensional hyperplane) is a concept 
class. Another concept class is the class J = {[0, a] : < a < 1} of initial segments of [0, 1]. 
We assume that the learner knows a priori the particular concept class C of which the target 
concept c» is a member. However, it is not a requirement that the hypothesis concept that 
the learner outputs come from this same concept class C. In fact, our proofs do not depend 
on the assumption that the learner outputs some member of C as an approximation to c*. 

Given two concepts c and c', let c@ c' denote the set (c - c') U (c' - c) of points that 
are classified differently by c and c'. This set of points on which c and c' disagree is called 
the symmetric difference of c and c'. Given two concepts c and c', we are not interested in 
simply the size of the symmetric difference. Rather, we are interested in the probability of 
the symmetric difference. 

More formally, we define the distance d D between concepts c and c' (with respect to 
probability D) as 

d D (c,c') = D(c@c') , 

the probability that c(x) ^ c'(x) for a point x randomly drawn according to D. That is, 
d D (c,c') is the probability that c and c' disagree on the classification of a randomly drawn 
point. Another way to think about this is that d D (c,c') measures the probability associated 
with the symmetric difference of c and c'. 

Now, given a target concept c» and another concept c, we wish to measure the distance 
^d(c,c») between the target concept and c. In this case, distance represents the error (again 
in a probabilistic sense) of concept c with respect to target concept c* . More formally, we 
define the error rate of a concept c (with respect to the distribution D and the target 
concept C,) to be d D (c,c„). Thus, the error rate of a concept c measures the probability 
associated with the symmetric difference of c„ and c. We remark that potentially the 



symmetric difference of the target concept and c can be a very large set and yet the error 
rate d D {c,c t ) can be very small, provided that there is little or no probability associated 
with the symmetric difference of c« and c. 

Under some probability distributions, it is in fact possible that the symmetric difference 
of two concepts c and d is non-empty, and yet has zero probability. In other words, although 
the two concepts do disagree on some set of points, there is no probability weight on these 
points. In this case, although the two concepts are not the same (that is, they do not classify 
all points the same), the error rate d D (c,d) equals zero. Thus we note that in general d D 
defines only a bounded pseudometric on C, for any probability distribution D. For d D to 
be a proper metric, it must be the case that for any incorrect concept c there is a positive 
probability d D (c,d) of drawing a random example that demonstrates that c is incorrect. 
If we were to consider only those probability distributions such that c ^ c' implies that 
d D (c,c') > 0, then d D would be a bounded metric on C. (In this thesis we do not state the 
conditions on measurability that may be required to make our claims go through. See, for 
example, Ben-David et al. [6] and Blumer et al. [9] for discussion.) 

Often, the hypothesis space of a concept class is uncountably infinite. For instance, the 
concept class 1 has an infinite hypothesis space: there are infinitely many initial segments 
[0,a] of [0,1]. The notion of the VC dimension of a hypothesis space offers a way of 
measuring such an infinite space. More specifically, the VC dimension is a combinatorial 
parameter of the hypothesis space. Haussler gives a very clear definition of VC dimension 
[13]: 

"The VC dimension of a hypothesis space H, denoted VCdim(jff), is defined 
to be the maximum number d of instances that can be labeled as positive and 
negative examples in all 2 d possible ways, such that each labeling is consistent 
with some hypothesis in H [10] [19]." 

In other words, the VC dimension is equal to the largest number m of points, such that for 
each element of the power set of the m points, the following holds: there exists a hypothesis 
in the concept class, such that each of the points in this particular element of the power set 
is classified as positive, and the remaining points (from the original set of m points) which 
are not in this element of the power set are classified as negative. For example, the VC 
dimension of the class I is 1. 



1.3 Pac-learning Definition 

The core idea of pac-learning is that a learning algorithm should compute with high prob- 
ability a hypothesis concept that is accurate to a given error rate. That is, the learner's 
goal in the pac-learning model is not to exactly identify the target concept, but rather to 
compute a hypothesis concept such that with high probability the hypothesis concept agrees 
with the target concept on almost all examples in the sample space. Thus we have Valiant's 
notion of probably approximately correct [13] [18]. 

In the case of "batch" learning (used in the models of this thesis), the learner first sees 
all examples, and then computes the hypothesis concept. In the case of "online" learning, 
the learner may recompute a hypothesis concept after each example seen. In either case, it 
is not sufficient for an algorithm simply to compute and output a hypothesis consistent or 
close to consistent with all examples seen thus far; the learner should have confidence that 
with high probability almost all future examples will be classified correctly. 

1.3.1 Pac-learning error rate 

A pac-learning algorithm is given as part of its input an error parameter €, which specifies 
the maximum error rate acceptable (this relates to being approximately correct), and a 
confidence parameter 6, which specifies the degree of certainty that the hypothesis will 
classify all unseen examples to within the specified error rate e (this relates to being probably 
approximately correct). 

More formally, let c represent the hypothesis concept computed by a run of a pac-learning 
algorithm. To pac-learn successfully, the algorithm, given input parameters e and 6, must, 
for all probability distributions D, with probability at least 1 - 6 compute a hypothesis c 
such that the error rate d D (c t ,c) < e: 

PT(d D (c.,c)<e)>l-8. (1.1) 

In Valiant's original definition, d D (c*,c) may be less than or equal to e. For the sake of 
simplicity, we assume that the error rate must be strictly less than e. 

We can now define computational complexity in pac-learning. 

1.3.2 Pac-learning complexity 

Valiant's goal was to encourage computer scientists to study complexity in learning theory 
via "precise computational models of the learning phenomenon"[18]. Valiant delineates 



two components of such learning models, which describe frameworks for learning target 
concepts: (1) a protocol for receiving information about the target concept, and (2) an 
algorithm for computing a hypothesis concept based on this information. 

We define sample complexity to be the number of examples the learner needs to see 
to compute a hypothesis concept. Then complexity in learning theory, as proposed by 
Valiant, is measured as a function both of sample complexity (related to (1) above), and of 
computational, or time, complexity, that is, how much computing a learning algorithm must 
do to output a good hypothesis concept (related to (2) above). The goal in pac-learning is 
to use only polynomial sample and time complexity to learn a target concept. This thesis 
focuses in particular on issues related to sample complexity. 

For a pac-learning algorithm to have polynomial sample complexity, it must have sample 
complexity polynomial in 1/c, 1/6 and the length of the input. For a pac-learning algorithm 
to have polynomial time complexity, it must take time polynomial in the number of examples 
received to compute a good hypothesis. 

In sum, in order for a pac-learning algorithm to be computationally efficient as defined 
by the pac-learning model, two conditions need to hold: 

1. The number of examples obtained, that is, the sample complexity, should be polyno- 
mial in 1/e, 1/6, and the length of the input. 

2. The time taken by the algorithm to compute a good hypothesis, that is, the compu- 
tational complexity, should be polynomial in the number of examples received. 

Henceforth, when we refer to pac-learning, we will be referring to computationally effi- 
cient pac-learning, unless otherwise noted. 

We make an assumption that c is sufficiently small. It is always true that 

l/e> -l/ln(l-e). 
For e < .39841, however, it is also true that 

1/(40 < -l/ln(l-20- 

We thus assume e is less than .39841; this assumption allows us to express our non- 
asymptotic lower bounds in the more standard form as a function of 1/e. For instance, 
in Theorem 2 we substitute the weaker lower bound m > (l/4e)hi(l/6) for the stronger 
m > (-l/ln(l - 2e))ln(l/£), where m is the number of examples obtained. 



1.3.3 Protocols for obtaining examples 

Given the pac-learning model, there are numerous ways in which the learner can obtain 
information about the target concept. 

In the most traditional form of pac-learning (used, for example, by Blumer et al. [8] 
and Valiant [18]), the protocol for receiving information is for the learner to draw random 
examples according to a fixed but unknown probability distribution on the sample space. 
(By unknown probability distribution, we mean unknown to the learner.) When the learner 
requests a sample point, the learner receives back an example drawn according to this 
unknown probability distribution. The example is labeled as either a positive or negative 
example of the concept. Any probability distribution on the sample space is possible. Since 
the complexity bounds must hold no matter what the unknown probability distribution is, 
this type of learning is also called distribution-free learning. 

A second protocol is to obtain information through queries. A model in which the 
learner asks for information is called learning using queries. Although touched upon by 
Valiant [18], learning using queries was left largely unnoticed until Angluin published Types 
of Queries For Concept Learning [1] in 1986. In her paper, Angluin defines a wide variety 
of protocols for receiving information about the target concept. The protocols all have in 
common the fact that they allow the learner more control over what examples are seen. For 
instance, the learner can choose a particular example and ask whether it is in the target 
concept; this is called making a membership query. In general, the goal in learning by query, 
in contrast to pac-learning, is to achieve exact learning, that is, to find the exact target 
concept. 

This thesis looks at models in which information is obtained by both drawing and choos- 
ing examples. 

1.4 New Definitions 

1.4.1 Dense-in-itself concept classes 

We are interested in the sample complexity necessary to pac-learn what we call a dense-in- 
itself concept class C. 

Definition 1 A concept class C is called dense in itself if for all concepts c e C, for all 
7 > 0, and for all finite measures fi on X, there is another concept c' eC (that is, c' ^ c) 
such that n(c © c') < 7. 

10 



A measure is similar to a probability distribution, in that it assigns weight to a sample 
space. However, in a probability distribution all the weight must sum to 1; in a finite 
measure, the weight can sum to any finite value. 

The existence of dense-in-itself concept classes is easily proved using techniques by Ben- 
David, Benadek, and Mansour in A Parametrization Scheme For Classifying Models of 
Learnability [7]. They first define a notion Cov D (A,B), such that one set of concepts A 
covers another set of concepts B with respect to a given probability distribution D, if for 
every c > and b € B, there exists an element a € A such that d D (a,b) < e. That is, 
for every b G B, we can always find an a € A such that the probability of the symmetric 
difference of a and 6 is less than e. Note that as they define it, a and b may be the same 
concept. 

We modify their definition to handle the case when A and B are the same concept class. 
In particular, we adapt their notation, and say that a set C is dense in itself if Cov^ D (C, C), 
where Cov!^ D (C,C) means that for every e > and c € C, there exists an element c' € C, 
c^ c', such that d D (c,c') < e. Observe here the extra specification that the two concepts c 
and c' cannot be the same. In sum, to make the analogy from their definition of Cotv/j(A, B) 
to our definition Cov^ D {C,C) of dense-in-itself, we add two conditions: the concept classes 
A and B must be the same, and the two concepts a and b must be different. 

Ben-David et al. also define the notion Cov s {A,B). By definition, Cov s (A,B) holds if 
"for every b € B there exists a sequence (oj : i € N) of elements of A such that lim^oo a< = 
6" [7]. That is, for every point x, there are only a finite number of concepts in the infinite 
o, sequence that disagree with b on their classification of x. Observe that the notion 
Cov s (A, B) depends solely on the nature of the sets A and B and is independent of any 
probability distribution. 

We define an analogous notion Cov' s (C, C), similar to Cov s (A, B), except again for the 
two conditions that the concept classes A and B must be the same, and that each of the 
concepts a,- must differ from 6. 

The notion Cov' s (C,C) applies to the concept classes that we consider. For instance, 
for the concept class of rectangles in the plane, we can think of a series of rectangles, each 
one containing the one before it, such that the series of rectangles converge to the last 
rectangle in the sequence, which is the target rectangle b. Similarly, we can think of a series 
of hyperplanes that are all parallel to and increasingly close to a target hyperplane (where 
close is measured by perpendicular distance to the hyperplane). 

Ben-David et al. prove that Cov s (A, B) implies Covv D (A, B). Their proof, essentially 
without modifications, also shows that Cov' s (C,C) implies Cov!„ D (C,C). Thus since the 

11 



concept classes we consider have the property Covs(C,C), the concept classes we con- 
sider have the property Cov^ D (C, C). Recalling that if a concept class has the property 
Cov!i D (C,C), then the class is dense in itself, we have that the concept classes we consider 
are dense in themselves. Some examples of concept classes that are dense in themselves are 
half-spaces in n-dimensional space, rectangles in the plane, and the concept class I. 

Recall that, by definition, a dense-in-itself concept class is guaranteed to contain, for 
any finite measure fi and concept c, a concept d £C,c^d, such that the measure fi of the 
symmetric difference of c and c' is arbitrarily small. Now the probability measure error rate, 
which is defined in Section 1.1, is a valid candidate for measure ft above, since error rate 
is a finite measure. Thus, substituting in the above definition error rate for fi and target 
concept c» for c, we have: for any given dense-in-itself concept class C and arbitrary target 
concept c*, we are guaranteed that there exists a concept c' with error rate (d D (c t ,c')) of 
arbitrarily small size. For example, if we are given a target concept c, and wish to ensure 
that the error rate of c, and some other concept is less than 10~ 6 , a dense-in-itself concept 
class guarantees us that there exists a concept d such that (d D (c t ,c') < 10~ 6 ). 

The following lemma provides a key property of a dense-in-itself concept class. 

Lemma 1 For every algorithm A that pac-learns a dense-in-itself concept class C using 
random examples and membership queries, and every e, 6, target concept c. in C, /3 > 0, 
and probability distribution D, there exists a concept c inC — {c*}, such that 

Pr(A(e, 6) sees examples in c* © c) < /? . 

Proof: Let S = c„ © c. Assume A runs under probability distribution D. We prove the 
lemma using Definition 1. Our strategy is to show that the probability measure 

Pr(i4(e, 6) sees examples in c* © c) 

is bounded above by the measure of the expected number of points that algorithm A(c, S) 
sees in 5. Then, since by definition of dense in itself we can make the measure of the 
expected number of points that algorithm A(e, S) sees in 5 as small as we like, it follows 
that we can make Pr(A(e, S) sees examples in c, © c) as small as we like too. 

We define fi(S) to be the expected number of points that algorithm A(e,6) sees in S: 

fi(S) = E( number of points A(e,S) sees in S) . (1.2) 

Then fi is a measure denned on X. Let P t equal the probability that A(e, 6) sees i examples 
in S. Then 

oo 

Pr(A(e, 6) sees any examples in S) = ^J ^» 



t=i 



12 



< £•* 



t=i 



Since by the definition of dense in itself we can find a concept c that allows us to make 
fj,(S) as small as we like (where S varies according to which c is chosen), the lemma follows. 



1.4.2 Smooth probability distributions 

In the original definition of the pac-learning model, all probability distributions are possi- 
ble. One can speculate that the fact that learning must be done under such a model drives 
up sample complexity. For instance, the fact that the fixed but unknown probability dis- 
tribution could be extremely skewed may force the algorithm to obtain far more examples 
than it might have to obtain if it had some information ahead of time about the probability 
distribution. 

We therefore consider whether having some advance knowledge about the nature of the 
probability distribution might, in fact, lower the number of samples the algorithm needs to 
obtain. Thus we look at the sample complexity necessary to pac-learn a concept class C 
under a smooth probability distribution. 

Definition 2 A probability distribution defined on R" is called smooth if there exists an 
s > 0, such that for all a > 0, the probability associated with a region of volume a is < sa. 

In other words, for every smooth probability distribution, there is an upper bound on 
the amount of probability that can be associated with any region, and this upper bound is 
in direct proportion to the size of the region. For instance, for the concept class 1, the value 
sa is an upper bound on the probability of a region of length a. Similarly, in n-dimensional 
space, the value sa is an upper bound on the probability of a region of volume a. Observe 
that under a smooth probability assumption, certain highly irregular distributions are not 
possible. For instance, there is a limit to how much probability weight can be found in a 
very small region, and there can be no probability weight assigned to a point. 

1.5 Pac-learning Background 

Some pac-learning history will help the reader view our results in light of past work. 

We recall that pac-learning imposes two complexity requirements: (1) the sample com- 
plexity, which is the number of samples drawn, must be polynomial in the length of an 

13 



input instance and in the inverse of the error and confidence parameters, and (2) the com- 
putational complexity, which is the computational time to find a good hypothesis (with 
good defined using the e and 6 parameters), must be polynomial in the number of examples 
received. 

The first classes shown to be pac-learnable under the distribution-free model were all 
classes of Boolean functions [18]. These classes— such as fc-DNF and fc-CNF— all have finite 
sample space, for example, {0,1}". In 1986, Blumer et al. extended the notion of what 
classes could be learned under the distribution-free pac-learning model to include classes 
in Euclidean n-dimensional space, such as half-spaces and hypercubes [8]. Significantly, 
concept classes in Euclidean n-dimensional space, in contrast to classes of Boolean functions, 
have uncountably infinite sample space: the sample space is R n . (Observe, however, that a 
representation of the concept, for instance, a defining hyperplane, can be specified finitely 
with a finite number of real-valued parameters.) Blumer et al. gave upper and lower 
polynomial bounds on sample complexity for these classes, and described polynomial- time 
algorithms to learn certain Euclidean n-dimensional classes [8]. 

In fact, a wide variety of concept classes have been shown to be learnable with both 
polynomial sample and time complexity. Examples of such classes include monomials over 
n Boolean variables [18], and, for constant k, &-DNF [18], fc-CNF [18], and fc-decision-lists 
[15] [17]. 

To a large extent, however, the time and sample complexity requirements are separable. 
In 1986 Blumer et al. [8] showed that a concept class has polynomial sample complexity if 
the VC dimension of the family of the concept class grows polynomially, and that if the VC 
dimension of a family of concept classes grows faster than polynomially, then the concept 
class is not pac-learnable [8]. 

There have also been significant results in the area of computational complexity. Two 
of the results mentioned here are based on differing definitions of the pac-learning model: 
one in which the hypothesis concept must come from the same concept class as the target 
concept, and one in which the hypothesis concept may come from a different class than the 
target concept. 

In 1986, Pitt and Valiant [16] showed that given a pac-learning model in which the 
output hypothesis concept must come from the same concept class as the target concept, 
the computational complexity to pac-learn certain concept classes is not polynomial. They 
proved that if finding a consistent hypothesis from the same class as the target concept 
class is JVP-hard, then the given concept class is not pac-learnable unless RP - NP. For 
instance, they showed that since finding a consistent 2-term DNF hypothesis for the 2-term 

14 



DNF concept class is NP-hard, 2-term DNF is not pac-learnable, assuming RP ^ NP. 

Kearns and Valiant proved computational complexity results for the pac-learning model 
in which the hypothesis concept may come from a different concept class than the target 
concept. They showed that even if the output hypothesis can come from a class outside 
the target concept class, the computational complexity to pac-learn certain concept classes 
is not polynomial. This result depends on certain complexity-theoretic cryptographic as- 
sumptions. Thus Kearns and Valiant showed that certain classes have polynomial sample 
complexity but not polynomial time complexity. Examples of such classes are Boolean 
formulas and finite automata [14]. 

Even when a class is pac-learnable, however, the traditional distribution-free model 
of pac-learning places certain restrictions on the ways in which a learner can learn, and 
these restrictions potentially drive up the sample complexity. For instance, the learner is 
restricted to receiving examples at random according to a fixed but unknown probability 
distribution: the learner has no choice over the particular examples seen. Additionally, the 
upper bounds for sample complexity must hold no matter what probability distribution the 
examples are drawn under: the bounds are worst case and so must hold even for extremely 
unusual distributions. These limitations in the traditional pac-learning model have caused 
researchers to look at ways of varying the model. 

One of the principle means of varying the model is to allow the learner to ask queries, or 
questions, of various sorts. Angluin and others have shown that numerous concept classes 
that are not pac-learnable using the traditional model of drawing examples can be exactly 
learned in polynomial time under various query models. An example of such a class is the 
class of regular languages [1] [2]. Examples of classes that the membership-query model 
in particular can be used to learn are monotone DNF, monotone CNF [3] [18], read-once 
Boolean formulas [4], and the union of two half- spaces [5]. In sum, the extra power inherent 
in the query model has proven sufficient to learn classes that are not pac-learnable under 
the random example model. 

1.6 The Results of this Thesis 

This thesis examines two variations on the standard distribution-free model. 

The first way we vary the model is to allow the learner to choose examples as well as 
draw random examples. Here, we combine the traditional pac-learning protocol of drawing 
random examples with the membership-query protocol of choosing examples to propose new 
models for learning using queries. 

15 



The second way we vary the model is to give the learner some advance knowledge about 
the probability distribution. We propose a model in which the learner knows a priori that 
the probability distribution is smooth. Since a smooth probability distribution is one that 
guarantees a specified upper bound on the amount of probability that can be associated 
with any region, certain distributions, such as extremely skewed distributions, are ruled 
out. 

There are two key ways our new models differ from other query models. First, research 
on learning using queries has focused on learning classes that do not have polynomial com- 
putational complexity, and so are too hard to learn under the pac-model. This thesis looks 
at classes already known to be pac-learnable, and so which are already known to have 
polynomial sample complexity. The issue addressed then is not whether the given concept 
class has polynomial sample complexity, but rather whether using membership queries in 
conjunction with random examples can reduce the sample complexity of the concept class. 

Second, most research on query models has focused on exactly identifying target concepts 
from classes with finite sample space. This thesis, however, looks at pac-learning classes 
which may have uncountably infinite sample space. So whereas membership queries are 
traditionally used to help exactly identify a concept from a finite class of concepts, we ask 
whether membership queries can help pac-learn (that is, approximately rather than exactly 
learn) a potentially infinite, even uncountable, concept class. 

In Chapter 2, we show that under the distribution-free pac-learning model, the power to 
both choose and draw random examples cannot significantly reduce sample complexity. In 
contrast, in Chapter 3, we show that under a model in which the learner has some limited 
information about the nature of the probability distribution, the power to both choose and 
draw examples does lead to a significant reduction in sample complexity. In Chapter 4, we 
state conclusions and open problems. 

One of our main results thus is that for traditional pac-learning models, membership 
queries do not help: the sample size required to pac-learn a dense-in-itself concept class if 
the learner can both choose and draw examples is within a constant factor of the sample size 
needed if the learner can just draw examples. Significantly, this lower bound is established 
for all probability distributions (as long as the probability distribution is unknown to the 
learner), not just for one worst-case probability distribution. On the other hand, we get 
significant results in favor of membership queries if the learner knows a priori that the 
probability distribution is smooth: if the learner can only draw examples, then the lower 
bound on sample complexity for pac-learning many concept classes does not improve, but 
if the learner can both choose and draw examples, then the sample and computational 

16 



m 




IT 



Chapter 2 



A Lower Bound When Both Drawing and 
Choosing Examples 



Up until now, when we have referred to pac-learning, we have meant the traditional model 
of pac-learning, in which the protocol for obtaining information about the sample space is 
to draw random examples. Henceforth, we consider any number of protocols to be viable 
ways of obtaining examples when pac-learning. In particular, we consider protocols that use 
random examples alone, membership queries alone, or a combination of both. We always 
specify the protocol being used. 

2.1 A Lower Bound 

Theorem 1 gives a lower bound on the number of examples required to pac-learn a dense- 
in-itself concept class C by any algorithm that can both draw random examples and make 
membership queries. 

Theorem 1 For every algorithm A that pac-learns a dense-in-itself concept class C using 
random examples and membership queries, and for every e, 8 > 0, probability distribution 
D, and target concept c» £ C, the total number of examples A needs to pac-learn C is 
ft((l/e)ln(l/«)). 

Note that A may be either randomized or deterministic, and it may draw examples 
and make membership queries in any order. The bound is on the worst-case number of 
examples seen, where the worst case is taken over the runs of the algorithm for the given 
target concept, probability distribution, e, and 6. We assume in our asymptotic notation 
that e, 6, and (3 (to be defined shortly) are going to zero in the limit. 

Proof: We begin by assuming that a pac-learning algorithm A, an e, a 6, a probability 
distribution D, and a target concept c» have been given. We claim that we, as adversary, 
can produce a probability distribution D' and a concept c' t that are close enough to D and 

18 



C, respectively, that if A does not draw Q((l/e)ln(l/<5)) random examples to distinguish 
these cases, then Pi(d D ,(c' t ,c) > e) > 6, so that A is not a pac-learning algorithm. 

Let (3 be an arbitrarily small positive number, and let c' t be a concept different than c, 
such that 

Pr(A sees examples in c, © c'„ | A draws according to D) < f3 . 

The existence of c'„ is guaranteed by Lemma 1. Then 

Pr(A sees no examples in c* © c' t \ A draws according to D) > 1 — /? . 

Note that if A sees no examples in c t ®c' t , then A cannot distinguish between target concept 
C and target concept c'„. 

We now choose any point w in c, © c'„. Define the probability distribution D,„ to be the 
distribution that assigns probability 1 to point w and probability to all other points. 

Given probability distributions D and D w , we make a new probability distribution D' as 
follows. We "skim" e probability off of D, while keeping the relative distribution of points 
in D the same. That is, we uniformly remove e probability from the probability distribution 
D such that the relative weight of points under D remains the same, but the total weight 
of points in D now adds up to 1 — e. We assign the excess e probability to w. Thus our 
new probability distribution D' is a linear combination of probability distribution D and 
probability distribution D w : 

D' = {\ -e)D + eD w . 

From now on, we suppose that algorithm A is drawing its random examples according 
to D'. When drawing an example according to D' , there are two possibilities: either the 
example is drawn "as if according to D" (this happens with probability 1-e), or the example 
is drawn "as if according to D w n (this happens with probability c). In general, we say that 
"A draws as if according to D" if A draws all of its random examples as if according to D. 

Suppose that we run algorithm A on distribution D' and target concept c», that A 
draws all of its examples as if according to distribution D, and that A sees no examples 
in c„ © c' t . By definition, a pac-learning algorithm with high probability correctly classifies 
any point with weight a or greater. Because A is a pac-learning algorithm and w has weight 
e, A's output hypothesis c classifies w correctly with high probability. Thus c with high 
probability agrees with c t (w), and so disagrees with c'„(w). 

We note that, on the one hand, c'„ is consistent with target concept c„ on all examples 
algorithm A sees with probability at least 1-/3, yet, on the other hand, c' t disagrees with 



19 



c. on the classification of w. Since A is a pac-learning algorithm, therefore, when A now 
tries to learn c'„ instead of c, on distribution D', we have that its output concept c satisfies 

Pr(c(w) ^ c' t (w) | A draws as if according to D) > (1 - 6 — j3) , (2.1) 

since we are assuming that A pac-learns c» on D correctly, and with probability 1-/3 the 
run looks the same as it would if c„ were the target concept. Observe that if A runs under 
probability distribution D', then saying c(w) ^ c',(w) is equivalent to saying dzj<(c,c'J > e. 

Let m = m(e, 6) denote the maximum possible total number of examples drawn and/or 
chosen by any run of A on inputs e and 6, and any probability distribution and any target 
concept. Since random examples are independent events, if all m examples obtained are 
random examples, then 

Pr(A draws as if according to D) = (1 — e) m . 
If less than m examples are random examples, then 

Pr(j4 draws as if according to D) > (1 — e) m . 
Assume that m is so small that it satisfies 

(l-e) m > S 



(1-6-/3) ' 

then 

S 
Pt(A draws as if according to D) > — -r . 

We show that this assumption implies that A does not pac-learn. 

Assume that in a given run A receives information about m sample points, by random 
example according to probability distribution D', by membership query, or by both. What 
then is the probability that during a run of A (on input distribution D', with target concept 
c' t ) the following events both happen? 

E = A misclassifies w (and so makes error at least e). 

F = A draws all its random examples as if according to D. 

Now, 

Ft(EF) = Pr(E | F)Pt(F) . 

Thus, since (by (2.1)) 

Pt(E I F) = Pr(A misclassifies w | A draws as if according to D) > 1 - 6 — f3 , 



20 



and 

Pr(F) = ?t(A draws as if according to D) > (1 - c) m > ,_ 6 _ g » 

we have 

?r(EF)>(l-6-3)- 1 _ 6 6 _ =6. 

Since Pr(£) > Vt(EF) > 6, we have Yr(d D >(J„c) > e) > 6, and therefore A has not 
pac-learned. 

Solving for m in (1 - e) m > 6/(1 -6-/3), we have 

mln(l - e) > ln(£) - ln(l - 6 - /?), 



so 



Thus, unless 






" s to(f^j (to< ?' +h ( 1 -'-*»■ 



in other words, unless m = n((l/€)ln(l/^)), Pr(d/>-«,c) > c) > ^, which proves our 
theorem. 



2.2 Comparison of the Lower Bound to Previous Results 

We already know from Ehrenfeucht et al. [12] that for c < 1/2, an algorithm that can only 
draw random examples must see at least 



n^l/^') 



examples to pac-learn, where VCdim(C) is the Vapnik-Chervonenkis dimension of the class 
C. Since our bound is within a constant factor of their lower bound (treating VCdim(C) 
as a constant as we let e and 6 go to zero), we have shown that the minimum number 
of examples needed by an algorithm A that can both draw random examples and make 
membership queries is within a constant factor of the minimum number of examples needed 
by an algorithm B that can only draw random examples. 



21 



We remark that, compared to the proof given by Ehrenfeucht, Haussler, Kearns, and 
Valiant [12], our proof has the advantage that it applies to every probability distribution (not 
just one constructed by the adversary), with the understanding of course that the learning 
algorithm has no prior knowledge about the actual probability distribution, and that the 
learning algorithm must pac-learn for every distribution. It also applies to learning from 
both random examples and membership queries (not just learning from random examples). 
On the other hand, our proof does not yield a lower bound that grows with the VC dimension 
of the concept class C. 



22 



Chapter 3 

Smoothness Guarantees 



Until now, we have assumed that any probability distribution was possible. Now suppose 
we have a "smoothness guarantee." Specifically, we assume that the learning algorithm has 
an additional input, s, and is told that any region of X having volume a, for any a, has 
probability at most sa. Can this assumption improve our lower bound? Our first claim 
is that if the learning algorithm uses random examples only, then a smoothness guarantee 
does not reduce the lower bound on the number of random examples required to pac-learn. 

We say (C,D) (where C is a concept class on a domain X and D is a probability 
distribution on X) has all error rates if for every target concept c* £ C, and for every 
nonnegative c < sup ceC {rf J5 (c»,c)}, there is a concept c £ C such that d D (c+,c) = e. For 
instance, if D is smooth, and C is the concept class of half-spaces of R n , rectangles in the 
plane, or X, then (C,D) has all error rates. 

3.1 A Lower Bound When Drawing Examples Only 

Theorem 2 For every algorithm B that draws random examples to pac-learn C , sufficiently 
small e > 0, 6 > 0, and smooth probability distribution D, if (C,D) has all error rates then 
the number of examples B needs to draw is greater than (l/4e)ln(l/£). 

Thus, even though an algorithm B has information that D is smooth, if C is a concept 
class such that (C,D) has all error rates, B still needs to draw (l/4e)ln(l/<5) examples to 
pac-learn C. Note that while the assumption that a concept class C under distribution D 
has all error rates may seem restrictive, as a practical matter it is quite reasonable, given 
the assumption of a smooth probability distribution. We leave it as an open problem to see 
whether this assumption actually is a restriction. 

Proof: 

Let c be a concept such that d D (c+, c) = 2e. (We have specified in the statement of The- 
orem 2 that e must be sufficiently small. To be more precise, if e < l/2(sup c€C {d D (c t: , c)}), 

23 



then by the assumption that (C, D) has all error rates, there will always be a concept c with 
error rate 2c. Adding this constraint to our earlier constraint in Section 1.3.2 that e must 
be less that .39841, we have c < min( .39841, l/2(sup e€C {d/>(c„c)})).) 

Let S = c* ® c. As before, let c be the hypothesis concept that a run of algorithm B 
outputs, and let c^ be the concept that we as adversary choose. 

We begin by showing that if B sees no points in 5, an adversary can force B to have 
error rate at least e. We conclude by showing that unless B draws at least (l/4e)ln(l/£) 
examples, then with probability greater than S the algorithm will see no points in S; in 
other words, with probability greater than 6 the algorithm will have error rate at least e. 

Given that B sees no points in S, we assume without loss of generality that c is more 
likely to agree with c„ (in the probabilistic sense) than with c on points in S. Then 

Pt(c(x) = c,(x) : x G S) > 1/2 , 

where the probability is based on the conditional distribution induced on S by distribu- 
tion D. 

In this case, we set c' # to c. (If c is more likely to agree with c, we set c'„ to c„; the 
reasoning for the rest of the proof remains the same.) Thus 

Pr(c(x) ^ c'Xx) : x e S) > 1/2 , 

where the probability again is based on the conditional distribution induced on S by distri- 
bution D. 

Since d D is a bounded pseudometric, the triangle inequality holds for do, giving 

dD(c m ,c) + d D (c,c',) > d D (c t ,c' t ) . (3.1) 

Now, by assumption, 

d D (c,c:)>d D (c.,c). (3.2) 

By (3.1) and (3.2), we have 

2d D (c,c:) > d D (c.,c:) , or d D (c,c',) > (Mc<))/2 • (3.3) 

Also, by assumption, we have 

24 



d D (c„c) = 2e. (3.4) 

Since we have set c', to c, (3.4) is equivalent to 

<k(c..O = 2e. (3.5) 

Thus, combining (3.3) and (3.5), we obtain 

d D (c,c? t )> 2e/2 = e, 

which shows that if B sees no points in S, B has error rate at least e. 

The probability that B independently draws m examples and fails to see a point in 5, 
that is, the probability that B has error rate at least e, is (1 — 2e) m . We fail to pac-learn 
if the probability of error rate at least e is greater than S, that is, if (1 - 2t) m > 6; given 
our assumption that e < .39841, we thus have that the lower bound on m even with the 
smoothness guarantee is (l/4e)ln(l/£). I 

3.2 Upper and Lower Bounds When Both Drawing and Choos- 
ing Examples 

So far in this chapter, we have shown that a smoothness guarantee does not significantly 
reduce the lower bound on sample complexity if the learning algorithm can only draw 
random examples. We now show, on the other hand, that if the learning algorithm can also 
make membership queries, then there exists a fast pac-learning algorithm whose sample 
and computational complexity upper bounds are well below the lower bounds proved in the 
previous section. We first prove this result for a one-dimensional concept class, and then 
extend the result to an n-dimensional concept class. (Our generalization to n dimensions is 
not completely general; we prove the result only for a specific concept class. It is an open 
problem to prove such a result for general bounded n-dimensional domains.) 

3.2.1 Using the binary-search approach in one dimension 

Theorem 3 There exists an algorithm A that pac-learns I using membership queries, such 
that for every e, 6, and smooth probability distribution D with known smoothness constant 
s, the number of examples A needs to pac-learn X is lg(s/e). For deterministic algorithms 
lg(s/c) is also a lower bound on the sample complexity. Additionally, if we assume unit cost 
for each example drawn or chosen, a deterministic algorithm A runs in time 0(lg(s/c)). 

25 



Note that although under this model the algorithm may both draw and choose examples, 
the algorithm we present only chooses examples. 

Proof: The proof is based on a "binary search." We call a hypothesis interval an interval 
defined by two points, which we call L and R. Here L is the value of the rightmost positive 
example seen so far, and R is the value of the leftmost negative example seen so far. Thus 
the interval [L,R] represents the points that we don't know how to classify yet; the right 
endpoint of c*, our target concept, is somewhere within this interval. 

In order to guarantee that the probability sa associated with a region of volume a is 
less than e, we set up the following equation, which gives us the requisite size for a: 

sa < e, or 

a < e/s. 

So if A has narrowed [L,R] to a size less than e/s, A is guaranteed that the probability 
associated with [L,R] is less than e/s • s, or e, and therefore can output any c with right 
endpoint in [L, R] with total confidence (<S = 0) that the error rate is less than e. Algorithm 
A can use membership queries to do a binary search on the initial interval [0,1], in which 
case A needs to make only lg(s/e) membership queries in order to narrow [L,R] to a size 
less than e/s. Observe that this upper bound of \g(s/e) on the sample complexity when 
using membership queries is substantially lower than the lower bound of (l/4e) ln(l/<5) on 
the sample complexity when using random examples only. 

To see that lg(s/e) is also a lower bound on the sample complexity for deterministic 
algorithms, consider an adversarial argument. An adversary can always label the requested 
examples so that after the algorithm makes lg(s/2e) queries, an interval [L, R] of size 2e/s 
remains. At this point, the adversary can always force the algorithm to output a hypothesis 
with error rate at least e. For instance, even if the algorithm outputs the concept represented 
by the midpoint of the interval, the adversary can choose the concept represented by L. 
Then, because the interval between L and the midpoint has length e/s, in the worst case 
the hypothesis will have error rate s • e/s = e. Since lg(s/2e) = lg(s/e) — 1, we have that 
greater than lg(s/e) — 1 queries are needed to ensure that in all cases the error is less than 
e; thus, at least lg(s/c) queries are necessary. 

If we assume it takes constant time for each membership query, then we also achieve the 
tight bound of 0(lg(s/e)) on the computational complexity for deterministic algorithms. 



26 



3.2.2 Generalizing the binary-search approach to n dimensions 

So far, we have given an algorithm which pac-learns the one- dimensional class T under 
the assumption of a smooth probability distribution. In short, such an algorithm pac-learns 
"half-spaces" of a one-dimensional line segment. We would also like to show how to pac-learn 
an analogous 7i-dimensional class. That is, we would like to pac-learn the set of half-spaces 
of an n-dimensional object, where the n-dimensional object is a logical generalization of the 
one-dimensional line segment. 

There are a number of objects that come to mind as logical generalizations of the one- 
dimensional line segment. Two such generalizations are the hypercube and the hypersphere. 
But perhaps the simplest generalization of a one- dimensional segment is the simplex. We 
will generalize the binary-search approach of the above section to learn half-spaces of a 
simplex. 

A simplex is a finite region of rc-space enclosed by n + 1 hyperplanes, where each hy- 
perplane has dimension n — 1. Importantly, a simplex has only n+1 corner points, n + 1 
faces, and ("^ ) edges. Each corner point is connected to each other corner point by an 
edge. For instance, a simplex in one dimension is a line, which is enclosed by two points 
(a point has dimension zero). A simplex in two dimensions is a triangle, which is enclosed 
by three lines. A simplex in three dimensions is a tetrahedron, which is enclosed by four 
planes, and so on [11]. A regular simplex is one in which each edge has the same length. 

A simplex thus has certain properties that make it easier to use as the generalization of 
a line segment than a hypercube or a hypersphere. For instance, observe that the number of 
faces, corner points, and edges of a simplex is linear in the dimension. This contrasts sharply 
with a hypercube, for example, in which the number of edges and corner points is exponential 
in the dimension. Our algorithm capitalizes — in fact, depends — on these properties of the 
simplex. We leave it as an open problem to pac-learn the set of hyperplanes which intersect 
a hypercube or a hypersphere. 

Let n represent the dimension of the concept class. 

Theorem 4 Let S be the class of hyperplanes which intersect a regular simplex. There 
exists an algorithm A that pac-learns S using membership queries, such that for every e, 6, 
and smooth probability distribution D with known smoothness constant s, the number of 
examples A needs to pac-learn S is n 2 (lg(s/e) + 4). Additionally, if we assume unit cost for 
each example drawn or chosen, A runs in time 0(n 7 s/e). 

Proof: 

27 



We consider a regular simplex in which each edge has length 1, and we assume that the 
coordinates of the simplex corners are given to the learning algorithm. The domain X of 
S consists of exactly those points on the boundary of the simplex and within the simplex. 
Given a target hyperplane that intersects a simplex, we consider the target hyperplane and 
all domain points on one side of the hyperplane to be positive examples of the concept, 
and all domain points on the other side of the hyperplane to be negative examples of the 
concept. Thus concept class S is represented by the set of hyperplanes of n — 1-dimensions 
that intersect the given simplex of n dimensions. 

We begin our analysis by giving the formulas for computing the height and volume of a 
regular simplex whose edges have length 1. 

The recurrence formula for the height of such an n-dimensional simplex is: 



H n+1 = Jl - (E n • n/(n + l)) 2 . 
This evaluates to 

H„ = y/(n + l)/2n. 

The volume V n of an n-dimensional simplex of edge-length 1 is equal to the base of the 
simplex (B n ) times the height of the simplex (/?„) divided by the dimension n: 

V n = B n • H n /n. 

Since the base B„ of an n-dimensional simplex is equal to the volume F„_i of the 
corresponding n - 1-dimensional simplex, the recurrence formula for V n is: 

V n = K-x • Hjn. 
This evaluates to 



V n = y/(n + l)/2»/n!. 



The surface area S n of an n-dimensional simplex is equal to the number of faces of an 
n-dimensional simplex (n + 1) times the volume of an n — 1-dimensional simplex (K»_i): 

5„ = (n + 1) • K,_! 



28 



(n+l)-y / n/2"- 1 /(«-l)!. 



For any target hyperplane which cuts the simplex, the learning algorithm A should 
output a hypothesis hyperplane such that the error rate of the hypothesis is less than e. 

Consider first the classification of each of the simplex corner points. In the trivial case 
where all n + 1 corner points are labeled positive, all points inside the simplex are also 
positive. In this case any hypothesis which classifies all simplex points as positive suffices. 
(Similarly if all n + 1 corner points are labeled negative.) Otherwise, there is at least one 
positive corner point and one negative corner point. 

Let k be the total number of positive corner points, and let / be the total number of 
negative corner points. Then k + I = n + 1 = the total number of corner points. Since each 
corner point is connected to each other corner point, the total number of "mixed" edges, 
that is, edges such that one endpoint is positive and one is negative, is kl. 

The separating target hyperplane must cut each of these kl mixed edges. In fact, the 
set of simplex edges that the target hyperplane cuts is precisely this set of kl mixed edges. 
(All other edges have both endpoints positive or both endpoints negative, and clearly the 
separating hyperplane cannot cut these edges.) 

By definition, the length of each edge of the simplex is 1. We wish to find a much smaller 
interval, which we call a cut-interval, on each mixed edge. A cut-interval is a sub-interval of 
a mixed edge that still retains the property that one endpoint is positive and one endpoint is 
negative. Since a mixed edge is one dimensional, we can apply the binary-search techniques 
of the previous section to find a cut-interval of a specified size on the mixed edge. 

Note we only need n points to determine an n — 1-dimensional hyperplane. By doing a 
binary search along each of the kl mixed edges, we find kl cut-intervals (rather than points). 
Since kl is always at least n, these kl cut-intervals define a constrained n-dimensional region 
within which the hyperplane must lie. (By definition of a simplex, and the fact that each 
of the kl cut-intervals comes from a different edge of the simplex, we are guaranteed that 
the kl regions will indeed define an n- dimensional region.) 

The overall strategy of the algorithm, then, is to find a cut-interval of a predetermined 
size, which we call width, on each of the kl mixed edges, and then output any hypothesis 
consistent with each of the cut-interval constraints. 

We present the algorithm, give a derivation for the value of the key variable width, 
prove the correctness of the algorithm, and finally give the sample complexity (number 

29 




Figure 3.1: Simplex with pancake region within the dashed lines; w = width. 

of membership queries needed) and computational complexity (time to find a hypothesis 
consistent with the constraints) analysis. 

3.2.2.1 The algorithm 

The algorithm begins by making n + 1 membership queries to obtain the classification of 
each corner point. 

For each of the resulting kl mixed edges, the algorithm does a binary search using 
membership queries to find a cut-interval of size width (width is precisely defined shortly). 
Since the length of each simplex edge is 1, the maximum number of queries necessary to do 
each binary search is lg(l/ width). 

Each of the two endpoints of each cut-interval determines a constraint on where the 
target hyperplane can lie, so there are 2kl constraints total. Any hypothesis consistent with 
the constraints is output. 

3.2.2.2 Derivation of width 

In order to analyze the sample and computational complexity, we need to determine the 
size of width necessary to achieve an error rate less than e. 

We form a "pancake" as follows: consider the n — 1-dimensional target hyperplane and 
a one-dimensional line perpendicular to this hyperplane. If we move the target hyperplane 
along the perpendicular a distance of width in each of the two directions away from the 
target hyperplane, then we trace out a pancake-like region with total width 2 • width. (See 
Figure 3.1.) 

We claim that the only points on which the target hyperplane and any hypothesis 
consistent with all kl constraints can disagree must lie in this n-dimensional pancake that 



30 



encloses the target hyperplane. We thus wish to choose a value of width so that the entire 
pancake has total probability less than e. 

The volume of our pancake is clearly less than or equal to the width of the pancake times 
the volume of the maximum cross section of the pancake. The volume of the maximum 
cross section of the pancake in turn is less than or equal to the volume of the maximum 
cross section of the simplex, which in turn is less than or equal to the surface area S n of the 
simplex. (We conjectured that the maximum cross section of a simplex is at most equal to 
the surface area of a single face of the simplex, but could not prove this conjecture.) Thus 
the volume of the pancake must be less than or equal to the width of the pancake, which is 
2 • width, times the surface area of the simplex, which is 5„. 

We wish that the total volume of the pancake should have probability weight less than 
6. So given smoothness constant s, it is necessary to have 

s ■ (2 • width ■ S n ) <e. (3.6) 

This is equivalent to 

s • 2 • width • (n + 1) • K-i < e, 

and to 

s • 2 • width • (n + 1) • ^/n/2 n -7(n - 1)! < c, 

yielding (with rearrangements of terms) 

width < e/s • 1/2 • (^""Vn • (n - l)!/(n + 1)). (3.7) 

Any value of width which satisfies (3.7) will also satisfy (3.6), and any value of width 
which satisfies (3.6) will ensure that the total volume of the pancake has probability weight 
less than e. So we choose width equal to one half the right-hand side of (3.7) - 

width = l/2(e/s • 1/2 • (^""Vn • (n - l)!/(n + 1)) 

= e/s ■ 1/4 • (^2-Vn ■ (n - l)!/(n + 1)). 

- and then analyze the resulting number of membership queries required given this choice 
of width. 

The number of membership queries necessary to narrow a cut-interval to size width is 
lg(l/width). Here, we have 

lg(l/ width) = lg(s/e) + lg4 + \$(y/n/2»-i ■ (n + l)/(n - 1)!). 

31 



Since 

lg( v /n/2»- 1 • (n + l)/(n - 1)!) < 1.6 

for all n, (and in fact is less than for n > 3), and 

lg4 = 2, 
we have that lg(s/e) + 4 is an upper bound on lg(l/ width): 

lg(l/ width) < lg(s/e) + 4 

So the number of membership queries necessary on each mixed edge is at most lg(s/e)+4. 
Thus, we must do a binary search making at most lg(s/e) + 4 membership queries on each 
of the kl mixed edges. Since kl < n 2 (1 < k < n, 1 < / < n, thus kl < n 2 ), the total sample 
complexity is at most n 2 (lg(5/f) + 4). 

3.2.2.3 Correctness of the algorithm 

We claim that the maximum error rate of any hypothesis hyperplane consistent with the 2kl 
constraints is less than e. We prove this by showing that all the points on which the target 
and hypothesis hyperplanes disagree — that is, the symmetric difference of the target and 
hypothesis concepts — must lie in the n- dimensional pancake region of width 2 • width that 
surrounds the target hyperplane. Since this pancake region has total probability weight less 
than e, the error rate of any consistent hypothesis must be less than c. 

We first show that each cut-interval endpoint lies within the pancake region, and then 
show that all points in the symmetric difference of the target and hypothesis concepts lie 
within the pancake. 

The total length of each cut-interval is width. This implies that the projection of each 
cut-interval endpoint onto the line perpendicular to the target hyperplane must also have 
distance to the target hyperplane less than or equal to width. Since the pancake is created by 
moving width along the perpendicular to the target hyperplane in each of the two directions 
away from the target hyperplane, each endpoint thus must lie within the pancake. 

We now establish that any point (not just the endpoints) on which the target and 
hypothesis concepts disagree lies within the specified pancake region. 

Consider the convex hull created by connecting the 2kl cut-interval endpoints. (We 
remark that not all of the ( 2 *') edges formed by connecting endpoints define the convex 
hull — some of these edges lie inside the convex hull.) The target and hypothesis hyper- 
planes lie entirely within this convex hull, since each point where the target and hypothesis 

32 



hyperplanes intersect an edge of the simplex is within a cut-interval. Thus, any point in the 
symmetric difference of the hypothesis and target concepts also lies within the convex hull. 
Now we establish that every point within the convex hull lies within the pancake region. 

As already stated, the distance from the target hyperplane to the projection of each of 
the 2kl cut-interval endpoints onto the target hyperplane perpendicular must be less than 
or equal to width. Since any point within the convex hull is a convex linear combination of 
cut-interval endpoints, we know any point in the convex hull region must also have distance 
to the target concept less than or equal to width. But then every point in the convex 
hull, which in particular includes all points in the symmetric difference of the target and 
hypothesis concepts, must be contained within the pancake region. Thus, since the pancake 
has total probability at most e, the hypothesis error rate must be less than c. 

3.2.2.4 Complexity of the algorithm 

The computational complexity is equal to the time necessary to compute a hypothesis 
consistent with the 2kl constraints. We can solve this constraint problem using linear 
programming in time polynomial in the number of constraints, 2kl, and the precision of 
each constraint, lg(l/width). The worst-case bound for solving the constraint problem 
with 2kl constraints and bit precision \g{l/ width) requires time 0((2kl) 3 5 • \g(l/width)). 
Since 2kl is 0(n 2 ), and ]g(l/width) is 0(s/€), we can find a hypothesis consistent with all 
2kl constraints in time 0(n 7 s/c). 

I 



hull — some of these edges lie inside the convex hull.) The target and hypothesis hyper- 
planes lie entirely within this convex hull, since each point where the target and hypothesis 

32 



Chapter 4 

Conclusions and Open Problems 



We have shown that membership queries do not help us to pac-learn a wide variety of 
classes. We have also shown that a priori knowledge of a smooth probability distribution 
does not help us to pac-learn many concept classes, when we are only allowed to draw 
random examples. A smooth probability distribution, however, does help us to pac-learn 
certain of these classes when we can also make membership queries. In particular, we have 
shown a significantly reduced upper bound on the number of samples needed to learn I, 
which is a concept class in one dimension, and «S, which is a concept class in n dimensions. 
We hope to generalize this upper bound to handle many other classes. 

The following open problems arise from this research: 

1. Improve the lower bound of Theorem 1 to depend upon VCdim(C). 

2. Generalize the n-dimensional approach of Theorem 3 to handle n — 1-dimensional 
hyperplanes intersecting arbitrary bounded convex n-dimensional domains (e.g. hy- 
perspheres or hypercubes). 

3. In regards to solving 2 above, we hypothesize that it is possible to make a reduction 
from our problem to the problem solved by Maass and Turan [15] of exactly learning 
a target hyperplane in a boolean-valued n-dimensional sample space. In making the 
reduction we would capitalize on the fact that we are working with a smooth proba- 
bility distribution and so simulate a boolean-valued domain in our continuous space. 
We would choose a small enough value for the distance between the boolean-valued 
points, so that if an algorithm exactly learned the target hyperplane, the error rate 
in the continuous space would be less than e. This reduction is not worked out, but 
if correct, would give good bounds on learning half-spaces in n-dimensional space. 



34 



Chapter 5 

Acknowledgments 



I first of all thank Ron Rivest. He has been a superb and supportive advisor. 

I thank Avrim Blum, Mic Grigni, Alex Ishii, Joe Kilian, James Park, Rob Schapire, and 
Robert Solovay for their helpful discussions and comments. 

I thank Jonathan Amsterdam, Aditi Dhagat, Sally Goldman, Michael Kearns, Norman 
Margolus, and Rob Schapire for proof-reading my paper and giving me advice on my COLT 
talk. 

Thanks especially to Tandy Warnow for always being such a nurturing and supportive 
friend-teacher, and to Phil Rogaway for always giving me such good advice. 

Special thanks to Be Hubbard both for all her help to me personally and for helping to 
create and maintain the warm, supportive, thriving theory community. 

I also thank all my other wonderful friends not already mentioned above, including: 
Wendy Brunzie, Carolyn Cox, Ann Feehan, Shalom Franck, Debra Goldentyer, Deborah 
Greif, Agiua Heath, Maya Kopell, Suzanne Lawrence, Mojdeh Mohteshemi, Sharon Port, 
Karen Saracik, Mark Schaeffer, Michal Shimshoni, Mona Singh, Julie Sweedler, Margaret 
and Mark Tuttle, Debbie Utley, Debra Waller, Jill Werman, Su-Ming Wu, and Guillermo 
Zemba. 

Most of all, I thank my wonderful parents for all their love and support. 



35 



Bibliography 



[1] Dana Angluin. Types of queries for concept learning. Technical Report 
YALEU/DCS/TR-479, Yale University Department of Computer Science, June 1986. 

[2] Dana Angluin. Learning regular sets from queries and counterexamples. Information 
and Computation, 75:87-106, November 1987. 

[3] Dana Angluin. Queries and concept learning. Machine Learning, 2(4):319-342, April 
1988. 

[4] Dana Angluin, Lisa Hellerstein, and Marek Karpinski. Learning read-once formulas 
with queries. Technical report, University of California, Report No. UCB/CSD 89/528, 
1989. 

[5] Eric B. Baum. On learning a union of half spaces. Journal of Complexity, 6(1):67-101, 
March 1990. 

[6] Shai Ben-David and Gyora M. Benedek. Measurability constraints on pac learnability. 
Technical report, Technion, Haifa, 1991. 

[7] Shai Ben-David, Gyora M. Benedek, and Yishay Mansour. A parametrization scheme 
for classifying models of learnability. In Proceedings of COLT '89, pages 285-302. 
Morgan Kaufmann, 1989. 

[8] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K. Warmuth. 
Classifying learnable geometric concepts with the Vapnik-Chervonenkis dimension. In 
Proceedings of the Eighteenth Annual ACM Symposium on Theory of Computing, pages 
273-282, Berkeley, California, May 1986. 

[9] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K. Warmuth. 
Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM, 36(4):929- 
965, 1989. 



36 



[10] Thomas M. Cover. Geometrical and statistical properties of systems of linear in- 
equalities with applications to pattern recognition. IEEE Transactions on Electronic 
Computers, EC-14(3):326-334, 1965. 

[11] H. S. M. Coxeter. Regular Polytopes. Dover Publications, Inc., New York, second 
edition, 1973. 

[12] Andrzej Ehrenfeucht, David Haussler, Michael Kearns, and Leslie Valiant. A gen- 
eral lower bound on the number of examples needed for learning. Information and 
Computation, 82(3):247-251, September 1989. 

[13] David Haussler. Probably approximately correct learning. Technical Report UCSC- 
CRL-90-16, University of California Santa Cruz, May 1990. 

[14] Michael Kearns and Leslie G. Valiant. Cryptographic limitations on learning boolean 
formulae and finite automata. In Proceedings of the Twenty-First Annual ACM Sym- 
posium on Theory of Computing, pages 433-444, Seattle, Washington, May 1989. 

[15] Wolfgang Maass and Gyogy Turan. How fast can a threshold gate learn? In Drastal, 
Hanson, and Rivest, editors, Computational Learning Theory and Natural Learning 
Systems: Constraints and Prospects. MIT Press. (To appear.). 

[16] Leonard Pitt. Recent results in computational learning theory, 1990. 

[17] Leonard Pitt and Leslie G. Valiant. Computational limitations on learning from ex- 
amples. Technical report, Harvard University Aiken Computation Laboratory, July 
1986. 

[18] Ronald L. Rivest. Learning decision lists. Machine Learning, 2(3):229-246, 1987. 

[19] Leslie G. Valiant. A theory of the learnable. Communications of the ACM, 
27(11):1134-1142, November 1984. 

[20] V. N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer- Verlag, 
New York, 1982. 



37